## DataLoader and DataSets

- Whenever we pass the data into model -> We pass into Batches (Batch SGD) so that it could be process easily with less CPU computation

Dataset= 100 -> Batch Size = 20 -> 100 / 20 => 5 Batches

#### MiniBatch Gradient Descent
- Memory efficient
- Better convergence

### DataSet -> For initialization of data
- __init__()
- __len__()
- __getitem__()

### DataLoader -> For Batches


In [1]:
from sklearn.datasets import make_classification 
import torch 

In [13]:
X, y = make_classification(
    n_classes=2,
    n_samples=10,
)

In [3]:
X

array([[-1.46930952, -0.51497568, -1.30086439,  0.73670307,  1.56182039,
         0.44285738, -0.58880513,  0.00645404,  0.85454259, -0.34811455,
         0.50807647,  0.37606431,  0.13186248, -0.58209252,  0.60098358,
        -1.34369256,  0.65292527, -1.16981983, -0.80594645,  0.83935654],
       [ 0.17656379,  0.89082523,  0.49459966,  0.72305362, -0.27583923,
         1.63250722, -1.42186842, -2.509238  , -0.29409555,  1.01560672,
        -0.28899211,  1.40538456, -1.07061075, -1.25630089, -0.92631494,
        -0.21765768,  0.2677878 , -0.14897068,  1.77968797,  1.03713925],
       [-1.17591226,  1.05949794,  1.08038682,  0.37089239,  0.06218165,
         0.43026521,  0.50641511, -2.01603235,  0.54936101,  0.88622384,
        -1.26766631,  1.16759521, -1.70193914, -0.05616054, -0.19514072,
         0.22667736,  1.59890752,  1.12114939,  1.59279811,  0.66209974],
       [-1.00343192,  1.08233792,  0.43348499,  1.24291002, -0.20513748,
        -0.16891171,  0.83927518, -0.08485098, -

In [4]:
y

array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

In [6]:
X_tensor = torch.tensor(X, dtype=float)
y_tensor = torch.tensor(y, dtype=float)

In [7]:
from torch.utils.data import Dataset, DataLoader

In [20]:
class CustomDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels # target columns 

    def __len__(self):
        return self.features.shape[0] # length of data -> row 

    def __getitem__(self, index):
        return self.features[index], self.labels[index]

In [21]:
dataset = CustomDataset(X,y)

In [22]:
dataset

<__main__.CustomDataset at 0x1e92d12ff80>

In [23]:
len(dataset)

10

In [19]:
dataset[2] ## 3rd row

(array([-1.17591226,  1.05949794,  1.08038682,  0.37089239,  0.06218165,
         0.43026521,  0.50641511, -2.01603235,  0.54936101,  0.88622384,
        -1.26766631,  1.16759521, -1.70193914, -0.05616054, -0.19514072,
         0.22667736,  1.59890752,  1.12114939,  1.59279811,  0.66209974]),
 np.int64(1))

In [24]:
dataset[9] # 10th row

(array([-0.17327744, -1.72436486, -0.15678134, -0.99263365, -0.67281922,
        -0.5741705 ,  0.32259687,  0.52925581, -1.16039618, -0.14840071,
        -1.02753022,  0.85893047,  0.22185898, -0.89442988,  0.12689251,
         0.65573562,  0.13768309,  1.09918242, -0.04560399, -1.47288108]),
 np.int64(0))

In [30]:
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
dataloader

<torch.utils.data.dataloader.DataLoader at 0x1e92f403f80>

In [31]:
for batch_features, batch_labels in dataloader:
    print(batch_features)
    print(batch_labels)
    print("="*50)

tensor([[ 1.2142,  0.2598, -1.9787, -0.3066,  0.1403, -0.0091, -0.2166, -1.5985,
         -1.0666,  0.0252, -1.3356,  1.2364, -0.5678,  0.7129,  1.2817,  0.5175,
         -0.3770, -0.1548, -0.4983,  0.9954],
        [ 0.3167, -0.4107, -1.1055, -0.5088, -0.7381,  1.6442,  0.0383, -0.9555,
          1.2027,  0.4919, -1.2454,  0.5074, -0.2720,  0.1841,  0.6591, -0.3227,
         -1.3143,  0.1847, -1.5457, -1.9082]], dtype=torch.float64)
tensor([1, 1])
tensor([[-0.6067, -0.1281, -0.9644,  0.0223, -2.2925,  1.8598, -0.7012, -1.7411,
         -0.7923,  0.5471, -0.5301,  0.1605, -0.7072,  0.1997, -1.5874, -0.1912,
         -1.3584, -1.6682,  1.0302, -0.7099],
        [-0.3375, -0.7183, -0.4378,  0.2005,  2.6039,  0.7262,  0.2609, -1.1909,
          0.6945,  0.1819,  1.7536, -1.3416,  2.0532,  1.0339, -1.3076, -0.5238,
         -0.0950,  0.2584, -1.9072,  0.9188]], dtype=torch.float64)
tensor([1, 1])
tensor([[ 1.0376,  1.1867, -0.6169,  1.0821, -0.6547,  0.8427,  2.0917,  0.0985,
          1.2

## Sentiment Analysis

1. Load the sentiment analysis data (test.csv)
2. Perform the text processing
3. Should be 2 columns -> Text, sentiment
4. Perform text encoding or embedding over the text data and labelEncoding and target column
5. Write RNN code into Pytorch to train sentiment analysis model 
6. Test the model with random example

In [None]:
##