Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

**Content**

It contains the following 6 fields:

**target**: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

**ids**: The id of the tweet ( 2087)

**date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

**flag**: The query (lyx). If there is no query, then this value is NO_QUERY.

**user**: the user that tweeted (robotickilldozr)

**text**: the text of the tweet (Lyx is cool)

In [29]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 1: Read The Data

In [30]:
import pandas as pd
import re

import warnings
warnings.filterwarnings('ignore')

In [31]:
columns=['target','ids','date','flag','user','text']

In [32]:
path='/content/drive/MyDrive/NLP/Tweet.csv'
df=pd.read_csv(path,encoding='ISO-8859-1',names=columns)
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [33]:
df.tail(1)

Unnamed: 0,target,ids,date,flag,user,text
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


In [34]:
dataset=df[['text','target']]
dataset.head()

Unnamed: 0,text,target
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


## Step 2: Remap The Target Column

In [35]:
dataset.target.unique()

array([0, 4])

In [36]:
dataset['target']=dataset['target'].replace(4,1)
dataset.target.unique()

array([0, 1])

## Step 3: Handling The Missing Values

In [37]:
dataset.isna().sum()

text      0
target    0
dtype: int64

## Step 4: Text Preprocessing

### Step 4.1 : Remove URLs Tags

In [38]:
str(dataset['text'][0])

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [39]:
pattern=re.compile(r'http[s]?:\/\/\S+')
pattern.sub(r'',str(dataset['text'][0]))

"@switchfoot  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [40]:
def remove_url(text):
  pattern=re.compile(r'http[s]?:\/\/\S+')
  return pattern.sub(r'',text)


In [41]:
dataset['text']=dataset['text'].apply(lambda x: remove_url(x))

In [42]:
dataset['text'].head()

0    @switchfoot  - Awww, that's a bummer.  You sho...
1    is upset that he can't update his Facebook by ...
2    @Kenichan I dived many times for the ball. Man...
3      my whole body feels itchy and like its on fire 
4    @nationwideclass no, it's not behaving at all....
Name: text, dtype: object

### Step 4.2: Remove HTML Tags

In [43]:
def remove_tags(text):
  pattern=re.compile(r'<.*?>')
  return pattern.sub(r'',text)

In [44]:
text='<p>Save the document by pressing <kbd>Ctrl + S</kbd></p>'
remove_tags(text)

'Save the document by pressing Ctrl + S'

In [45]:
dataset['text']=dataset['text'].apply(lambda x: remove_tags(x))

### Step 4.3: Handleing Emoticons

In [46]:
#Emojis
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad',
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed',
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink',
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat',';D':'laughing'}


In [47]:
'Emoji'+emojis[':)']

'Emojismile'

In [48]:
def remove_emoticons(text):
  for emoji in emojis:
    text=text.replace(emoji,'Emoji'+emojis[emoji])

  return text


In [49]:
dataset['text']=dataset['text'].apply(lambda x: remove_emoticons(x))

In [50]:
dataset['text'][0]

"@switchfoot  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. Emojilaughing"

### Step 4.4: Handeling Emojis

In [51]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.9.0-py2.py3-none-any.whl (397 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.5/397.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.9.0


In [52]:
text='Business: We open at 10. 😀'

import emoji
print(type(emoji.demojize(text)))

<class 'str'>


In [53]:
def remove_emoji(text):
  return emoji.demojize(text)

In [54]:
remove_emoji(''' Business: Hi Jane, it looks like order X25D has been delayed for 2 days due to a backup in the factory. 😞''')

' Business: Hi Jane, it looks like order X25D has been delayed for 2 days due to a backup in the factory. :disappointed_face:'

In [55]:
dataset['text']=dataset['text'].apply(lambda x: remove_emoji(x))

In [56]:
dataset['text'][0]

"@switchfoot  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. Emojilaughing"

### Step 4.5: Handeling the Users

In [57]:
def handle_user(text):
  pattern=re.compile(r'@[^\s]+')
  text=pattern.sub('Tuser',text)

  return text

In [58]:
handle_user(dataset['text'][0])

"Tuser  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. Emojilaughing"

In [59]:
dataset['text']=dataset['text'].apply(lambda x: handle_user(x))

### Step 4.6: Remove Punctuation

In [60]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [61]:
punc=string.punctuation

def remove_punc(text):
  return text.translate(str.maketrans("","",punc))

In [62]:
remove_punc('Hi! How are you?')

'Hi How are you'

In [63]:
dataset['text']=dataset['text'].apply(lambda x: remove_punc(x))

### Step 4.7: Remove Chat Word or Slang Words

In [64]:
slang='/content/drive/MyDrive/NLP/slang.txt'

In [65]:
slang

'/content/drive/MyDrive/NLP/slang.txt'

In [66]:
with open(slang,'r') as f:
  lines=f.readlines()

In [67]:
lines[0]

'AFAIK=As Far As I Know\n'

In [68]:
lines[0].split('=')

['AFAIK', 'As Far As I Know\n']

In [69]:
lines[0].split('=')[0]

'AFAIK'

In [70]:
lines[0].split('=')[1][:-1]

'As Far As I Know'

In [71]:
slang_dict={}
for i in range(len(lines)):
  slang_dict[lines[i].split('=')[0]]=lines[i].split('=')[1][:-1]


In [72]:
slang_dict

{'AFAIK': 'As Far As I Know',
 'AFK': 'Away From Keyboard',
 'ASAP': 'As Soon As Possible',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'A3': 'Anytime, Anywhere, Anyplace',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRT': 'Be Right There',
 'BTW': 'By The Way',
 'B4': 'Before',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FWIW': "For What It's Worth",
 'FYI': 'For Your Information',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GN': 'Good Night',
 'GMTA': 'Great Minds Think Alike',
 'GR8': 'Great!',
 'G9': 'Genius',
 'IC': 'I See',
 'ICQ': 'I Seek you (also a chat program)',
 'ILU': 'ILU: I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'KISS': 'Keep It Simple, Stupid',
 'LDR': 'Long Distance Relationship',
 'LM

In [73]:
def remove_chatwords(text):
  new_text=[]
  for w in text.split():
    if w.upper() in slang_dict:
      new_text.append(slang_dict[w.upper()])
    else:
      new_text.append(w)

  return " ".join(new_text)


In [74]:
remove_chatwords('rofl ! This is so funny')

'Rolling On The Floor Laughing ! This is so funny'

In [75]:
dataset['text']=dataset['text'].apply(lambda x: remove_chatwords(x))

### Step 4.8: Make Lower Case

In [76]:
dataset['text']=dataset['text'].str.lower()

### Step 4.9: Spelling Correction

In [77]:
# ! pip install textblob

In [78]:
from textblob import TextBlob

str(TextBlob('I luve Honey').correct())

'I love Money'

In [79]:
text='Thise is not treu'
tl=text.split()

In [80]:
" ".join([str(TextBlob(i).correct()) for i in tl])

'Hise is not true'

In [81]:
!pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622364 sha256=e04fa72a0fa5ccbb4ce5ff5033b26b58b4df4ebdd7ec31e5aa11e8c01989e9bc
  Stored in directory: /root/.cache/pip/wheels/b5/7b/6d/b76b29ce11ff8e2521c8c7dd0e5bfee4fb1789d76193124343
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.6.1


In [82]:
from autocorrect import Speller

spell = Speller(lang='en')

print([spell(i) for i in tl])

['This', 'is', 'not', 'true']


In [83]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.0-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.0


In [84]:
from spellchecker import SpellChecker

In [85]:
def spell_correct(text):
  tl=text.split()
  spell = SpellChecker()


  misspelled = spell.unknown(tl)
  return " ".join([spell.correction(i) for i in tl])


In [86]:
spell_correct('Thes is not my shurt')

'the is not my hurt'

**Note: Since none of the spell correcting module working properly therefore we are not applying it on our data set**

### Step 4.10: Tokenization

In [87]:
!pip install nltk



In [88]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [89]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [90]:
sent_tokenize('''It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.''')

['It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word.',
 'It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list.',
 'Those words that are found more often in the frequency list are more likely the correct results.']

In [91]:
type(word_tokenize('I Love Pizza'))

list

In [92]:
def word_token(text):
  return word_tokenize(text)


In [93]:
dataset_copy=dataset.copy()

In [94]:
dataset_copy.head()

Unnamed: 0,text,target
0,tuser awww thats a bummer you shoulda got davi...,0
1,is upset that he cant update his facebook by t...,0
2,tuser i dived many times for the ball managed ...,0
3,my whole body feels itchy and like its on fire,0
4,tuser no its not behaving at all im mad why am...,0


In [95]:
dataset['text']=dataset['text'].apply(lambda x: word_token(x))

### Step 4.11: Stop Word Removal

In [96]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [97]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [98]:
print(len(stopwords.words('english')))

179


In [99]:
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [100]:
stop_w=stopwords.words('english')

In [101]:
text_list=word_tokenize('i love pizza')
clean_text=[word for word in text_list if word not in stop_w]

In [102]:
from functools import lru_cache

@lru_cache(maxsize=50000)
def remove_stopword(text):
  stop_w=stopwords.words('english')
  text_list=text.split()
  clean_text=[word for word in text_list if word not in stop_w]
  return clean_text

In [103]:
remove_stopword('i love pizza')

['love', 'pizza']

In [104]:
dataset=dataset_copy.copy()

In [105]:
dataset.head()

Unnamed: 0,text,target
0,tuser awww thats a bummer you shoulda got davi...,0
1,is upset that he cant update his facebook by t...,0
2,tuser i dived many times for the ball managed ...,0
3,my whole body feels itchy and like its on fire,0
4,tuser no its not behaving at all im mad why am...,0


In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_stopword(x))

In [None]:
len(dataset['text'][0])

In [None]:
len(dataset_copy['text'][0])

### Step 4.12: Stemming

In [None]:
from nltk.stem.porter import PorterStemmer

st=PorterStemmer()
stem=lru_cache(maxsize=50000)(st.stem)
def stemming_on_data(list_words):
  text=[stem(word) for word in list_words]

  return text

In [None]:
dataset['text']=dataset['text'].apply(lambda x: stemming_on_data(x))

In [None]:
dataset.head()

Unnamed: 0,text,target
0,"[tuser, awww, that, bummer, shoulda, got, davi...",0
1,"[upset, cant, updat, facebook, text, might, cr...",0
2,"[tuser, dive, mani, time, ball, manag, save, 5...",0
3,"[whole, bodi, feel, itchi, like, fire]",0
4,"[tuser, behav, im, mad, cant, see]",0


### Step 4.13: Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
def list_tosent(list_words):
  return ' '.join(list_words)

list_tosent(dataset['text'][0])

'tuser awww that bummer shoulda got david carr third day emojilaugh'

In [None]:
dataset['text']=dataset['text'].apply(lambda x: list_tosent(x))

In [None]:
lm=WordNetLemmatizer()
@lru_cache(maxsize=50000)
def lemmatization_on_data(list_words):
  list_word=list_words.split()
  text=[lm.lemmatize(word) for word in list_word]

  return text


In [None]:
dataset['text']=dataset['text'].apply(lambda x: lemmatization_on_data(x))

In [None]:
new_dataset=dataset.copy()

In [None]:
dataset['text']=dataset['text'].apply(lambda x: list_tosent(x))

In [None]:
dataset.head()

Unnamed: 0,text,target
0,tuser awww that bummer shoulda got david carr ...,0
1,upset cant updat facebook text might cri resul...,0
2,tuser dive mani time ball manag save 50 rest g...,0
3,whole bodi feel itchi like fire,0
4,tuser behav im mad cant see,0


## Step 5: Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(dataset['text'],dataset['target'],test_size=0.2,random_state=42)

tfidf=TfidfVectorizer(max_features=500000,ngram_range=(1,3),stop_words='english')

X_train_tfidf=tfidf.fit_transform(X_train)
X_test_tfif=tfidf.transform(X_test)

In [None]:
X_train_tfidf.shape

(1280000, 500000)

In [None]:
for i,f in enumerate(tfidf.get_feature_names_out()):
  print(i,f)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
495000 yeahw
495001 yeahwel
495002 yeahwer
495003 yeahyeah
495004 yeahyou
495005 yeahyour
495006 yeai
495007 yeaim
495008 yeait
495009 yeap
495010 yeap got
495011 yeap im
495012 yeap yeap
495013 year
495014 year 10
495015 year 11
495016 year 11 left
495017 year 12
495018 year 13
495019 year 1st
495020 year 2008
495021 year 2010
495022 year 2011
495023 year 2nd
495024 year 3000
495025 year 40
495026 year 40 year
495027 year activ
495028 year actual
495029 year afford
495030 year age
495031 year ago
495032 year ago amp
495033 year ago awesom
495034 year ago didnt
495035 year ago dont
495036 year ago fail
495037 year ago feel
495038 year ago good
495039 year ago got
495040 year ago great
495041 year ago havent
495042 year ago hope
495043 year ago horribl
495044 year ago im
495045 year ago laugh
495046 year ago long
495047 year ago love
495048 year ago make
495049 year ago miss
495050 year ago nice
495051 year ago realli
4950

## Step 6: Apply algorithm and Predict the Sentiment

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nb_model=MultinomialNB()
nb_model.fit(X_train_tfidf,y_train)

y_pred=nb_model.predict(X_test_tfif)
print(accuracy_score(y_test,y_pred))

0.773321875


In [None]:


def sentiment(list_of_tweets):
  new_tweet=tfidf.transform(list_of_tweets)
  if nb_model.predict(new_tweet)==1:
    return 'Happy'

  else:
    return 'Unhappy'

In [None]:
new_tweet=['i am unhappy']
sentiment(new_tweet)

'Unhappy'

In [None]:
def cleaner(text):
  pattern=re.compile(r'http[s]?:\/\/\S+')
  text= pattern.sub(r'',text)
  text=text.translate(str.maketrans("","",punc))

  return text



In [None]:
new=[(cleaner(new_tweet[0]))]

In [None]:
sentiment(new)

'Unhappy'

#Section 2 : Sentiment Analysis Using RNN

#Step 2.1 : Find the Unique Words

In [None]:
new_dataset.head()

Unnamed: 0,text,target
0,"[tuser, awww, that, bummer, shoulda, got, davi...",0
1,"[upset, cant, updat, facebook, text, might, cr...",0
2,"[tuser, dive, mani, time, ball, manag, save, 5...",0
3,"[whole, bodi, feel, itchi, like, fire]",0
4,"[tuser, behav, im, mad, cant, see]",0


In [None]:
words=set()

for data in new_dataset['text']:
  for word in data:
    words.add(word)


In [None]:
number_of_words=len(words)
number_of_words

396196

In [None]:
new_dataset['text']=new_dataset['text'].apply(lambda x: list_tosent(x))

In [None]:
new_dataset=pd.read_csv('/content/drive/MyDrive/NLP/processed_tweets (1).csv',index=False)

TypeError: read_csv() got an unexpected keyword argument 'index'

##Step 2.2 : Import Libraries and Data

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

In [None]:
max_features=396196     #numbee_of_words

In [None]:
new_dataset=pd.read_csv('/content/drive/MyDrive/NLP/processed_tweets (1).csv')
new_dataset.head(1)

In [None]:
new_dataset['text']=new_dataset['text'].astype('str')

In [None]:
(new_dataset['text']).head()

In [None]:
new_dataset['text'].values

In [None]:
tokenizer_keras=Tokenizer(num_words=max_features,split=' ')

In [None]:
tokenizer_keras.fit_on_texts(new_dataset['text'].values)

In [None]:
x=tokenizer_keras.texts_to_sequences(new_dataset['text'].values)
x

In [None]:
type(x)

In [None]:
x[:5]

In [None]:
new_dataset['text'][:5]

In [None]:
tokenizer_keras.word_index

In [None]:
y=pd.get_dummies(new_dataset['target']).values

In [None]:
y[:2]

##Step 2.3 : Pad Sequences

In [None]:
len(x)

In [None]:
x=pad_sequences(x)

In [None]:
x[:5]

In [None]:
type(x)

## Step 2.4 : Split the Data

In [None]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
valid_size=240000
x_valid=x_test[-valid_size:]
y_valid=y_test[-valid_size:]
x_test=x_test[:-valid_size]
y_test=y_test[:-valid_size]

##Step 2.5 : Create the RNN Architecture

In [None]:
from keras.models import Sequential
from keras.layers import Dense,Embedding,SimpleRNN,SpatialDropout1D
from keras.optimizers import Adam
from keras.regularizers import L2


In [None]:
embed_dim=128

In [None]:
#to detect the TPU
tpu=tf.distribute.cluster_resolver.TPUClusterResolver.connect()

#Instantiate the TPU
tpu_strategy=tf.distribute.TPUStrategy(tpu)

with tpu_strategy.scope():
  model=Sequential()
  model.add(Embedding(max_features,embed_dim,input_lenght=x.shape[1]))
  model.add(SpatialDropout1D(0.4))
  model.add(SimpleRNN(196,dropout=0.2,recurrent_dropout=0.2))
  model.add(Dense(2,activation='softmax',kernel_regularizer=L2(0.001)))

  model.compile(loss='categorical_crossentropy',optimizer=Adam(learning_rate=0.0001),metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
from keras import callbacks

earlystopping=callbacks.EarlyStopping(monitor='val_loss',
                                      mode='min',
                                      patience=5,
                                      restore_best_weights=True)

model.fit(x_train,y_train,epochs=20,batch_size=512,verbose=1,
          validation_data=(x_valid,y_valid),
          callbacks=[earlystopping])