# Introduction
Hello community, welcome to this kernel. In this kernel we're going to discover how to classify texts using TD-IDF vectorization and fully connected Pytorch models.

**We'll use Torch based solutions and play the game by its rules :)**

In this kernel I did not wanna use RNNs and Word Embeddings because they're way harder so we'll see them in the next kernel.

So let's start!

In [None]:
import numpy as np
import pandas as pd
import re 

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data import Dataset



# Step 1: Preparing Dataset
In this step we're going to read our dataset, create our dataset class and prepare our dataloaders. 

In [None]:
data = pd.concat([pd.read_csv('../input/tweets-with-sarcasm-and-irony/train.csv'),
                  pd.read_csv("../input/tweets-with-sarcasm-and-irony/test.csv")],axis=0)


* You know, if test set has labels it's pretty useless to split them before making it ready to use. So I concatenated them.

In [None]:
data.info()

In [None]:
data.head()

In [None]:
np.unique(list(data["class"]))

* It seems like we have nan values, let's remove them.

In [None]:
data.dropna(inplace=True)
data.info()

* Now it's okay. Let's write a class which will clean and vectorize our texts.

In [None]:
class Vectorizer():
    def __init__(self,clean_pattern=None,max_features=None,stop_words=None):
        self.clean_pattern = clean_pattern
        self.max_features = max_features
        self.stopwords = stop_words
        self.tfidf = TfidfVectorizer(stop_words=self.stopwords,max_features=self.max_features)
        self.builded = False
        
    
    def _clean_texts(self,texts):
        
        cleaned = []
        for text in texts:
            if self.clean_pattern is not None:
                text = re.sub(self.clean_pattern," ",text)
            
            text = text.lower().strip()
            cleaned.append(text)
        
        return cleaned
    
    
    def _set_tfidf(self,cleaned_texts):
        self.tfidf.fit(cleaned_texts)
    
    def build_vectorizer(self,texts):
        cleaned_texts = self._clean_texts(texts)
        self._set_tfidf(cleaned_texts)
        self.builded = True
        
    def vectorizeTexts(self,texts):
        if self.builded:
            cleaned_texts = self._clean_texts(texts)
            return self.tfidf.transform(cleaned_texts)
        
        else:
            raise Exception("Vectorizer is not builded.")
            
            

* And let's create an object from this class and make our dataset cleaned and vectorized

In [None]:
x = list(data["tweets"])
y = list(data["class"])

In [None]:
vectorizer = Vectorizer("[^a-zA-Z0-9]",max_features=7000,stop_words="english");

In [None]:
vectorizer.build_vectorizer(x)

In [None]:
vectorized_x = vectorizer.vectorizeTexts(x).toarray()


In [None]:
vectorized_x.shape

* And now everything is okay with texts, let's encode the classes.

In [None]:
label_map = {
    "figurative":0,
    "sarcasm":1,
    "irony":2,
    "regular":3
}


In [None]:
y_encoded = []
for y_sample in y:
    y_encoded.append(label_map[y_sample])
    
y_encoded = np.asarray(y_encoded)

In [None]:
y_encoded.shape

* And we're ready to create our custom Dataset object by inheriting it.

In [None]:
class TweetDataset(Dataset):
    
    def __init__(self,x_vectorized,y_encoded):
        self.x_vectorized = x_vectorized
        self.y_encoded = y_encoded
        
    
    def __len__(self):
        return len(self.x_vectorized)
    
    
    def __getitem__(self,index):
        return self.x_vectorized[index],self.y_encoded[index]
    
    

* It's really easy to implement a custom dataset, let's create an object and test it.

In [None]:
dataset = TweetDataset(vectorized_x,y_encoded)
print("Length of our dataset is",len(dataset))

print(dataset[2])

* You know, to get random samples we need a random subset sampler, now we'll prepare it.

In [None]:
# We've splitted our indices as train and test to use them in subset samplers.
train_indices,test_indices = train_test_split(list(range(0,len(dataset))),test_size=0.25,random_state=42)

In [None]:
print(len(train_indices))
print(len(test_indices))

In [None]:
train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(test_indices)


* Our dataset and samplers are ready, we can create our data loader objects and start to model our artifical neural network.

In [None]:
BATCH_SIZE = 128
train_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, 
                                           sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE,
                                                sampler=test_sampler)

# Step 2: Building Neural Network Architecture

In this section we're gonna create a custom network class which will be inherited from nn.Module and after creating our simple neural network we'll create loss function and optimizer.

In [None]:
class DenseNetwork(nn.Module):
    
    def __init__(self):
        super(DenseNetwork,self).__init__()
        self.fc1 = nn.Linear(7000,1024)
        self.drop1 = nn.Dropout(0.4)
        self.fc2 = nn.Linear(1024,256)
        self.drop2 = nn.Dropout(0.4)
        self.prediction = nn.Linear(256,4)
        
    def forward(self,x):
        
        x = F.relu(self.fc1(x.to(torch.float)))
        x = self.drop1(x)
        x = F.relu(self.fc2(x))
        x = self.drop2(x)
        x = F.log_softmax(self.prediction(x),dim=1)
        
        return x

* And our small and lovely neural network is ready, before creating a model object let's define our device (gpu)

In [None]:
device = torch.device("cuda")
device

* And now it's time for creating the model.

In [None]:
model = DenseNetwork().to(device)


* And now we'll declare our criterion (loss) and optimizer.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.RMSprop(model.parameters(),lr=1e-3)


# Step 3: Training The Neural Network
Our model and dataset is ready, so in this section we're gonna train our model.

In [None]:
EPOCHS = 6
TRAIN_LOSSES = []
TRAIN_ACCURACIES = []

for epoch in range(1,EPOCHS+1):
    epoch_loss = 0.0
    epoch_true = 0
    epoch_total = 0
    for data_,target_ in train_loader:
        data_ = data_.to(device)
        target_ = target_.to(device)
        
        # Cleaning optimizer cache.
        optimizer.zero_grad()
        
        # Forward propagation
        outputs = model(data_)
        
        # Computing loss & backward propagation
        loss = criterion(outputs,target_)
        loss.backward()
        
        # Applying gradients
        optimizer.step()
        
        epoch_loss += loss.item()
        
        _,pred = torch.max(outputs,dim=1)
        epoch_true = epoch_true + torch.sum(pred == target_).item()
        
        epoch_total += target_.size(0)
        
    TRAIN_LOSSES.append(epoch_loss)
    TRAIN_ACCURACIES.append(100 * epoch_true / epoch_total)
    
    print(f"Epoch {epoch}/{EPOCHS} finished: train_loss = {epoch_loss}, train_accuracy = {TRAIN_ACCURACIES[epoch-1]}")
    
        
        

# Step 4: Testing Model
We've trained our model and it's time to test our model using our test set.

In [None]:
test_true = 0
test_total = len(test_sampler)
test_loss = 0.0
with torch.no_grad():
    for data_,target_ in validation_loader:
        data_,target_ = data_.to(device),target_.to(device)
        
        outputs = model(data_)
        
        loss = criterion(outputs,target_).item()
        
        _,pred = torch.max(outputs,dim=1)
        
        test_true += torch.sum(pred==target_).item()
        test_loss += loss
        

print(f"Validation finished: Accuracy = {round(100 * test_true / test_total,2)}%, Loss = {test_loss}")

# Conclusion

Hey! We've finished this kernel and discovered how to use Pytorch and TF-IDF together. Validation accuracy might seems bad, but it's because of our data processing. If we would process it better it'd be better.

If you have a question about this, please ask me in the comment section of this kernel and also mention me because I generally can't see them if you don't mention me.

Have a good day/night and if you liked this kernel, please upvote to support and motivate me :)